1 Project brief

We want to investigate the avocado dataset, and, in particular, to model the AveragePrice of the avocados. Use the tools we’ve worked with this week in order to prepare your dataset and find appropriate predictors. Once you’ve built your model use the validation techniques discussed on Wednesday to evaluate it. Feel free to focus either on building an explanatory or a predictive model, or both if you are feeling energetic!

As part of the MVP we want you not to just run the code but also have a go at interpreting the results and write your thinking in comments in your script.

Hints and tips

  • region may lead to many dummy variables. Think carefully about whether to include this variable or not (there is no one ‘right’ answer to this!)
  • Think about whether each variable is categorical or numerical. If categorical, make sure that the variable is represented as a factor.
  • We will not treat this data as a time series, so Date will not be needed in your models, but can you extract any useful features out of Date before you discard it?
  • If you want to build a predictive model, consider using either leaps or glmulti to help with this.

1.1 Researching and preparing our data

Here is what we found looking for information on the ‘avocado’ data. I am accepting this info as reliable.

“The table represents weekly retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.”

Relevant info for understanding ‘obscure’ variable names:

AveragePrice - the average price of a single avocado Region - the city or region of the observation, i.e. where avocados were sold. Total Volume - Total number of avocados sold 4046 - Total number of small avocados sold (PLU 4046) 4225 - Total number of medium avocados sold (PLU 4225) 4770 - Total number of large avocados sold (PLU 4770)

Apparently average price recorded here is not related to bag size so we can drop these variables. Although region may have an impact on price we have decided to drop ‘region’ when doing manual model development. Instead, we will keep region when testing and automated model development.

the x1 variable records the week in which sales were recorded in a 52 weeks per year format. Although our brief is not interested in time series and forecasting we can investigate if seasonality has an impact on average price. Avocados are very sensitive to variations in temperature so weather patterns may impact production and potentially prices. We have decided to keep only data for years 2015 - 2017 dropping partial 2018 data. This could help especially if seasons play some role on average price.

So, we’ll focus on average price, type and total volume. We’ll use x1, date and year to engineer variables which will enable us to explore seasonality.

One line conclusion: Weather, especially around October, can have an impact on supply which in turn will influence avocado prices.

Afterthoughts: Thinking carefully about the data and asking the right questions will help with variable engineering and as a result modelling accuracy and outcomes. The ability to run multiple models in a short time helps with this ‘go between’ process and hopefully increases both data value and understanding which may lead to informed quantitative decision making.

library(tidyverse)
library(janitor)
library(ggfortify)
library(GGally)
library(lubridate)
library(modelr)
library(skimr)

1.1.1 cleaning var names, subsetting

avocado_df_exp <- read_csv("data/avocado.csv") %>% 
  clean_names() %>% 
  select(x1:x4770, type:year) %>% 
  rename(week = "x1",
         small = "x4046",
         medium = "x4225",
         large = "x4770") %>% 
  filter(date <= "2017-12-31")
## Warning: Missing column names filled in: 'X1' [1]
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   Date = col_date(format = ""),
##   AveragePrice = col_double(),
##   `Total Volume` = col_double(),
##   `4046` = col_double(),
##   `4225` = col_double(),
##   `4770` = col_double(),
##   `Total Bags` = col_double(),
##   `Small Bags` = col_double(),
##   `Large Bags` = col_double(),
##   `XLarge Bags` = col_double(),
##   type = col_character(),
##   year = col_double(),
##   region = col_character()
## )
avocado_tidy <- avocado_df_exp %>%
  mutate(month = as.character(month(date))) %>% 
  mutate(season = case_when( 
           month == "12" | month == "1" | month == "2" ~ "winter",
           month == "3" | month == "4" | month == "5" ~ "spring",
           month == "6" | month == "7" | month == "8" ~ "summer",
           month == "9" | month == "10" | month == "11" ~ "autumn")
         ) %>% 
  mutate(type = as.factor(type)) %>% 
  mutate(season = as.factor(season)) %>% 
  mutate(year = as.factor(year)) %>% 
  #mutate(week = as.factor(week))
  select(-date)

We expect total volume to be strongly correlated to avocado sizes so we test and if confirmed drop avocado sizes variables.

avocado_tidy %>% 
  select(total_volume:large) %>% 
  ggpairs()

It is clear we can use total volume as the ‘size’ variable in our analsysis.

avocado_tidy <- avocado_tidy %>% 
  select(-c(small, medium, large))

Let’s look at summary statistics. We’ll employ both summary() and skim() functions to compare their different output formats.

summary(avocado_tidy)
##       week       average_price   total_volume                type     
##  Min.   : 0.00   Min.   :0.44   Min.   :      85   conventional:8478  
##  1st Qu.:13.00   1st Qu.:1.10   1st Qu.:   10460   organic     :8475  
##  Median :26.00   Median :1.37   Median :  104849                      
##  Mean   :25.66   Mean   :1.41   Mean   :  834110                      
##  3rd Qu.:39.00   3rd Qu.:1.67   3rd Qu.:  423186                      
##  Max.   :52.00   Max.   :3.25   Max.   :61034457                      
##    year         month              season    
##  2015:5615   Length:16953       autumn:4212  
##  2016:5616   Class :character   spring:4320  
##  2017:5722   Mode  :character   summer:4210  
##                                 winter:4211  
##                                              
## 
avocado_tidy %>% 
  skim()
Data summary
Name Piped data
Number of rows 16953
Number of columns 7
_______________________
Column type frequency:
character 1
factor 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
month 0 1 1 2 0 12 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
type 0 1 FALSE 2 con: 8478, org: 8475
year 0 1 FALSE 3 201: 5722, 201: 5616, 201: 5615
season 0 1 FALSE 4 spr: 4320, aut: 4212, win: 4211, sum: 4210

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
week 0 1 25.66 15.11 0.00 13.00 26.00 39.00 52.00 ▇▇▇▇▇
average_price 0 1 1.41 0.41 0.44 1.10 1.37 1.67 3.25 ▂▇▅▁▁
total_volume 0 1 834109.85 3381120.41 84.56 10459.56 104849.39 423186.06 61034457.10 ▇▁▁▁▁

`total volume’ is extremely skewed so this may affect our models. We need to look into this.

total_vol_by_type <- avocado_tidy %>% 
  group_by(type) %>% 
  summarise(avg_total_vol= mean(total_volume)) %>%
  mutate(pct = prop.table(avg_total_vol) * 100)

total_vol_by_type

More than 97 % of avocados in the data is conventional. It makes sense to focus on this type for average price modelling (for comparison we have provided manual modelling on a separate notebook) .

avocado_tidy_conv <- avocado_tidy %>% 
  filter(type == "conventional") %>% 
  select(-type)
avocado_tidy_org <- avocado_tidy %>% 
  filter(type == "organic") %>% 
  select(-type)

1.2 Visualising our data

both_types <- ggplot(avocado_tidy) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas",
      subtitle = "both types") +
 theme_minimal()

conventional <- ggplot(avocado_tidy_conv) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas",
      subtitle = "conventional") +
 theme_minimal()

organic <- ggplot(avocado_tidy_org) +
 aes(x = total_volume, y = average_price) +
 geom_point(size = 1L, colour = "#0c4c8a") +
 geom_smooth(span = 0.75) +
 scale_x_continuous(trans = "log") +
 scale_y_continuous(trans = "log") +
 labs(title = "Average price decreases when Total Volume increseas",
      subtitle = "organic") +
 theme_minimal()

both_types
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

conventional
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

organic
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(avocado_df_exp) +
 aes(x = date, y = average_price, colour = type) +
 geom_line(size = 1L) +
 scale_color_hue() +
 labs(title = "Average Price has a certain degree of seasonality") +
 theme_minimal() +
 facet_wrap(vars(type))

ggplot(avocado_df_exp) +
 aes(x = type, y = average_price, fill = type) +
 geom_boxplot() +
 scale_fill_hue() +
 labs(title = "As expected average price is higher for organic type") +
 theme_minimal()

ggplot(avocado_df_exp) +
 aes(x = date, weight = total_volume) +
 geom_bar(fill = "#0c4c8a") +
 labs(title = "Total Volume has also a pattern of seasonality") +
 theme_minimal()

1.3 Model development

1.3.1 First Predictor - month

avocado_tidy_conv %>% 
   ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.1.1 average price + total volume

mod_total_volume <- lm(average_price ~ log(total_volume), data = avocado_tidy_conv)
mod_total_volume
## 
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy_conv)
## 
## Coefficients:
##       (Intercept)  log(total_volume)  
##           1.75676           -0.04544
summary(mod_total_volume)
## 
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65584 -0.18583 -0.02832  0.15142  1.06102 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.756763   0.027874   63.03   <2e-16 ***
## log(total_volume) -0.045444   0.002113  -21.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2599 on 8476 degrees of freedom
## Multiple R-squared:  0.05175,    Adjusted R-squared:  0.05164 
## F-statistic: 462.6 on 1 and 8476 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_total_volume)

1.3.1.2 average price + month

mod_month <- lm(average_price ~ month, data = avocado_tidy_conv)
mod_month
## 
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy_conv)
## 
## Coefficients:
## (Intercept)      month10      month11      month12       month2       month3  
##     1.03694      0.31239      0.16911      0.04045     -0.03765      0.08790  
##      month4       month5       month6       month7       month8       month9  
##     0.10541      0.05263      0.11225      0.17554      0.19845      0.25779
summary(mod_month)
## 
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71474 -0.17605 -0.02249  0.16751  0.87066 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.036944   0.009007 115.125  < 2e-16 ***
## month10      0.312394   0.012738  24.525  < 2e-16 ***
## month11      0.169110   0.012981  13.028  < 2e-16 ***
## month12      0.040449   0.012981   3.116  0.00184 ** 
## month2      -0.037654   0.013258  -2.840  0.00452 ** 
## month3       0.087899   0.012981   6.772 1.36e-11 ***
## month4       0.105406   0.012981   8.120 5.31e-16 ***
## month5       0.052632   0.012738   4.132 3.63e-05 ***
## month6       0.112253   0.013258   8.467  < 2e-16 ***
## month7       0.175542   0.012738  13.781  < 2e-16 ***
## month8       0.198454   0.012981  15.288  < 2e-16 ***
## month9       0.257793   0.013258  19.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2477 on 8466 degrees of freedom
## Multiple R-squared:  0.1397, Adjusted R-squared:  0.1386 
## F-statistic:   125 on 11 and 8466 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month)

1.3.1.3 average price + week

mod_week <- lm(average_price ~ week, data = avocado_tidy_conv)
mod_week
## 
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy_conv)
## 
## Coefficients:
## (Intercept)         week  
##    1.263930    -0.004035
summary(mod_week)
## 
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77393 -0.18287 -0.02453  0.16081  1.00450 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.2639300  0.0055623  227.23   <2e-16 ***
## week        -0.0040355  0.0001867  -21.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2598 on 8476 degrees of freedom
## Multiple R-squared:  0.05221,    Adjusted R-squared:  0.0521 
## F-statistic:   467 on 1 and 8476 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_week)

1.3.1.4 average price + season

mod_season <- lm(average_price ~ season, data = avocado_tidy_conv)
mod_season
## 
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy_conv)
## 
## Coefficients:
##  (Intercept)  seasonspring  seasonsummer  seasonwinter  
##      1.28478      -0.16659      -0.08413      -0.24594
summary(mod_season)
## 
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70478 -0.17819 -0.02065  0.16181  0.93522 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.284777   0.005463  235.18   <2e-16 ***
## seasonspring -0.166587   0.007677  -21.70   <2e-16 ***
## seasonsummer -0.084126   0.007726  -10.89   <2e-16 ***
## seasonwinter -0.245935   0.007726  -31.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2507 on 8474 degrees of freedom
## Multiple R-squared:  0.1176, Adjusted R-squared:  0.1173 
## F-statistic: 376.3 on 3 and 8474 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_season)

1.3.1.5 average price + year

mod_year <- lm(average_price ~ year, data = avocado_tidy_conv)
mod_year
## 
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy_conv)
## 
## Coefficients:
## (Intercept)     year2016     year2017  
##     1.07796      0.02763      0.21693
summary(mod_year)
## 
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83489 -0.15559 -0.00559  0.15441  1.09441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.077963   0.004694 229.663  < 2e-16 ***
## year2016    0.027632   0.006638   4.163 3.18e-05 ***
## year2017    0.216925   0.006606  32.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2487 on 8475 degrees of freedom
## Multiple R-squared:  0.1314, Adjusted R-squared:  0.1312 
## F-statistic: 640.8 on 2 and 8475 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_year)

1.3.2 Second Predictor - year

remaining_resid <- avocado_tidy_conv %>% 
  add_residuals(mod_month) %>% 
  select(-c(average_price, month))
remaining_resid %>% 
  ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.2.1 avg_p + month + total_volume

mod_month_total_volume <- lm(average_price ~ month + log(total_volume), data = avocado_tidy_conv)
mod_month_total_volume
## 
## Call:
## lm(formula = average_price ~ month + log(total_volume), data = avocado_tidy_conv)
## 
## Coefficients:
##       (Intercept)            month10            month11            month12  
##           1.58425            0.30265            0.15852            0.03508  
##            month2             month3             month4             month5  
##          -0.03444            0.08558            0.10551            0.05772  
##            month6             month7             month8             month9  
##           0.11515            0.17562            0.19540            0.25206  
## log(total_volume)  
##          -0.04154
summary(mod_month_total_volume)
## 
## Call:
## lm(formula = average_price ~ month + log(total_volume), data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66874 -0.17550 -0.02301  0.15789  0.87952 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.58425    0.02741  57.807  < 2e-16 ***
## month10            0.30265    0.01243  24.357  < 2e-16 ***
## month11            0.15852    0.01266  12.518  < 2e-16 ***
## month12            0.03508    0.01266   2.772  0.00558 ** 
## month2            -0.03444    0.01293  -2.665  0.00772 ** 
## month3             0.08558    0.01265   6.763 1.44e-11 ***
## month4             0.10551    0.01265   8.338  < 2e-16 ***
## month5             0.05772    0.01242   4.648 3.41e-06 ***
## month6             0.11515    0.01293   8.909  < 2e-16 ***
## month7             0.17563    0.01242  14.144  < 2e-16 ***
## month8             0.19540    0.01265  15.442  < 2e-16 ***
## month9             0.25206    0.01293  19.499  < 2e-16 ***
## log(total_volume) -0.04154    0.00197 -21.082  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2414 on 8465 degrees of freedom
## Multiple R-squared:  0.1826, Adjusted R-squared:  0.1815 
## F-statistic: 157.6 on 12 and 8465 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_total_volume)

1.3.2.2 avg_p + month + year

mod_month_year <- lm(average_price ~ month + year, data = avocado_tidy_conv)
mod_month_year
## 
## Call:
## lm(formula = average_price ~ month + year, data = avocado_tidy_conv)
## 
## Coefficients:
## (Intercept)      month10      month11      month12       month2       month3  
##     0.94914      0.31239      0.18127      0.03583     -0.03180      0.10006  
##      month4       month5       month6       month7       month8       month9  
##     0.10078      0.06821      0.11811      0.17554      0.21061      0.26365  
##    year2016     year2017  
##     0.02771      0.21814
summary(mod_month_year)
## 
## Call:
## lm(formula = average_price ~ month + year, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74855 -0.15728 -0.00253  0.14710  0.91188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.949141   0.009103 104.264  < 2e-16 ***
## month10      0.312394   0.011720  26.656  < 2e-16 ***
## month11      0.181267   0.011954  15.163  < 2e-16 ***
## month12      0.035826   0.011946   2.999  0.00272 ** 
## month2      -0.031801   0.012201  -2.606  0.00916 ** 
## month3       0.100056   0.011954   8.370  < 2e-16 ***
## month4       0.100783   0.011946   8.437  < 2e-16 ***
## month5       0.068214   0.011728   5.817 6.23e-09 ***
## month6       0.118107   0.012201   9.680  < 2e-16 ***
## month7       0.175542   0.011720  14.979  < 2e-16 ***
## month8       0.210612   0.011954  17.618  < 2e-16 ***
## month9       0.263647   0.012201  21.609  < 2e-16 ***
## year2016     0.027709   0.006094   4.547 5.52e-06 ***
## year2017     0.218141   0.006071  35.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2279 on 8464 degrees of freedom
## Multiple R-squared:  0.2719, Adjusted R-squared:  0.2708 
## F-statistic: 243.2 on 13 and 8464 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_year)

1.3.2.3 avg_p + month + week

mod_month_week <- lm(average_price ~ month + week, data = avocado_tidy_conv)
mod_month_week
## 
## Call:
## lm(formula = average_price ~ month + week, data = avocado_tidy_conv)
## 
## Coefficients:
## (Intercept)      month10      month11      month12       month2       month3  
##    0.645495     0.620809     0.513111     0.418515    -0.003386     0.155117  
##      month4       month5       month6       month7       month8       month9  
##    0.206690     0.189894     0.283595     0.381152     0.439650     0.531940  
##        week  
##    0.007908
summary(mod_month_week)
## 
## Call:
## lm(formula = average_price ~ month + week, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72396 -0.17540 -0.02266  0.16719  0.87043 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.645495   0.102075   6.324 2.68e-10 ***
## month10      0.620809   0.081114   7.654 2.17e-14 ***
## month11      0.513111   0.090289   5.683 1.37e-08 ***
## month12      0.418515   0.099054   4.225 2.41e-05 ***
## month2      -0.003386   0.015960  -0.212 0.831989    
## month3       0.155117   0.021750   7.132 1.07e-12 ***
## month4       0.206690   0.029332   7.047 1.98e-12 ***
## month5       0.189894   0.037857   5.016 5.38e-07 ***
## month6       0.283595   0.046435   6.107 1.06e-09 ***
## month7       0.381152   0.054902   6.942 4.14e-12 ***
## month8       0.439650   0.063978   6.872 6.78e-12 ***
## month9       0.531940   0.072430   7.344 2.26e-13 ***
## week         0.007908   0.002054   3.850 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2475 on 8465 degrees of freedom
## Multiple R-squared:  0.1412, Adjusted R-squared:   0.14 
## F-statistic:   116 on 12 and 8465 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_week)

1.3.2.4 avg_p + month + season

mod_month_season <- lm(average_price ~ month + season, data = avocado_tidy_conv)
mod_month_season
## 
## Call:
## lm(formula = average_price ~ month + season, data = avocado_tidy_conv)
## 
## Coefficients:
##  (Intercept)       month10       month11       month12        month2  
##      1.03694       0.31239       0.16911       0.04045      -0.03765  
##       month3        month4        month5        month6        month7  
##      0.08790       0.10541       0.05263       0.11225       0.17554  
##       month8        month9  seasonspring  seasonsummer  seasonwinter  
##      0.19845       0.25779            NA            NA            NA
summary(mod_month_season)
## 
## Call:
## lm(formula = average_price ~ month + season, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71474 -0.17605 -0.02249  0.16751  0.87066 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.036944   0.009007 115.125  < 2e-16 ***
## month10       0.312394   0.012738  24.525  < 2e-16 ***
## month11       0.169110   0.012981  13.028  < 2e-16 ***
## month12       0.040449   0.012981   3.116  0.00184 ** 
## month2       -0.037654   0.013258  -2.840  0.00452 ** 
## month3        0.087899   0.012981   6.772 1.36e-11 ***
## month4        0.105406   0.012981   8.120 5.31e-16 ***
## month5        0.052632   0.012738   4.132 3.63e-05 ***
## month6        0.112253   0.013258   8.467  < 2e-16 ***
## month7        0.175542   0.012738  13.781  < 2e-16 ***
## month8        0.198454   0.012981  15.288  < 2e-16 ***
## month9        0.257793   0.013258  19.444  < 2e-16 ***
## seasonspring        NA         NA      NA       NA    
## seasonsummer        NA         NA      NA       NA    
## seasonwinter        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2477 on 8466 degrees of freedom
## Multiple R-squared:  0.1397, Adjusted R-squared:  0.1386 
## F-statistic:   125 on 11 and 8466 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_season)

1.3.3 Third Predictor - total_volume

remaining_resid <- avocado_tidy_conv %>% 
  add_residuals(mod_month_year) %>% 
  select(-c(average_price, month, year))
remaining_resid %>% 
  ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.3.3.1 avg_p + month + year + total volume

mod_month_year_total_volume <- lm(average_price ~ month + year + log(total_volume), data = avocado_tidy_conv)
mod_month_year_total_volume
## 
## Call:
## lm(formula = average_price ~ month + year + log(total_volume), 
##     data = avocado_tidy_conv)
## 
## Coefficients:
##       (Intercept)            month10            month11            month12  
##           1.51701            0.30222            0.17068            0.03033  
##            month2             month3             month4             month5  
##          -0.02822            0.09810            0.10099            0.07387  
##            month6             month7             month8             month9  
##           0.12135            0.17563            0.20790            0.25789  
##          year2016           year2017  log(total_volume)  
##           0.03238            0.22297           -0.04336
summary(mod_month_year_total_volume)
## 
## Call:
## lm(formula = average_price ~ month + year + log(total_volume), 
##     data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69624 -0.15698 -0.00183  0.14641  0.91822 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.517006   0.025174  60.262  < 2e-16 ***
## month10            0.302221   0.011346  26.636  < 2e-16 ***
## month11            0.170679   0.011574  14.747  < 2e-16 ***
## month12            0.030325   0.011559   2.623  0.00872 ** 
## month2            -0.028222   0.011805  -2.391  0.01683 *  
## month3             0.098104   0.011566   8.482  < 2e-16 ***
## month4             0.100985   0.011557   8.738  < 2e-16 ***
## month5             0.073872   0.011348   6.509 7.97e-11 ***
## month6             0.121354   0.011805  10.280  < 2e-16 ***
## month7             0.175628   0.011338  15.490  < 2e-16 ***
## month8             0.207896   0.011566  17.975  < 2e-16 ***
## month9             0.257891   0.011806  21.844  < 2e-16 ***
## year2016           0.032376   0.005899   5.488 4.17e-08 ***
## year2017           0.222975   0.005877  37.938  < 2e-16 ***
## log(total_volume) -0.043357   0.001801 -24.080  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2204 on 8463 degrees of freedom
## Multiple R-squared:  0.3186, Adjusted R-squared:  0.3175 
## F-statistic: 282.7 on 14 and 8463 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_year_total_volume)

1.3.3.2 avg_p + month + year + season

mod_month_year_season <- lm(average_price ~ month + year + season, data = avocado_tidy_conv)
mod_month_year_season
## 
## Call:
## lm(formula = average_price ~ month + year + season, data = avocado_tidy_conv)
## 
## Coefficients:
##  (Intercept)       month10       month11       month12        month2  
##      0.94914       0.31239       0.18127       0.03583      -0.03180  
##       month3        month4        month5        month6        month7  
##      0.10006       0.10078       0.06821       0.11811       0.17554  
##       month8        month9      year2016      year2017  seasonspring  
##      0.21061       0.26365       0.02771       0.21814            NA  
## seasonsummer  seasonwinter  
##           NA            NA
summary(mod_month_year_season)
## 
## Call:
## lm(formula = average_price ~ month + year + season, data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74855 -0.15728 -0.00253  0.14710  0.91188 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.949141   0.009103 104.264  < 2e-16 ***
## month10       0.312394   0.011720  26.656  < 2e-16 ***
## month11       0.181267   0.011954  15.163  < 2e-16 ***
## month12       0.035826   0.011946   2.999  0.00272 ** 
## month2       -0.031801   0.012201  -2.606  0.00916 ** 
## month3        0.100056   0.011954   8.370  < 2e-16 ***
## month4        0.100783   0.011946   8.437  < 2e-16 ***
## month5        0.068214   0.011728   5.817 6.23e-09 ***
## month6        0.118107   0.012201   9.680  < 2e-16 ***
## month7        0.175542   0.011720  14.979  < 2e-16 ***
## month8        0.210612   0.011954  17.618  < 2e-16 ***
## month9        0.263647   0.012201  21.609  < 2e-16 ***
## year2016      0.027709   0.006094   4.547 5.52e-06 ***
## year2017      0.218141   0.006071  35.929  < 2e-16 ***
## seasonspring        NA         NA      NA       NA    
## seasonsummer        NA         NA      NA       NA    
## seasonwinter        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2279 on 8464 degrees of freedom
## Multiple R-squared:  0.2719, Adjusted R-squared:  0.2708 
## F-statistic: 243.2 on 13 and 8464 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2)) 
plot(mod_month_year_season)

1.3.4 Interactions

average_price_residual <- avocado_tidy_conv %>% 
  add_residuals(mod_month_year_total_volume) %>% 
  select(-average_price)
coplot(resid ~ log(total_volume) | month,
       panel = function(x, y, ...){
         points(x, y)
         abline(lm(y ~ x), col = "blue")
       },
       data = average_price_residual, columns=6)

average_price_residual %>%
  ggplot(aes(x = total_volume, y = resid, colour = season)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

1.3.4.1 month - year

mod_interaction1 <- lm(average_price ~ month + year + total_volume + month:year, data = avocado_tidy_conv)
summary(mod_interaction1)
## 
## Call:
## lm(formula = average_price ~ month + year + total_volume + month:year, 
##     data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72198 -0.12933  0.00136  0.13026  0.84243 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.100e+00  1.413e-02  77.865  < 2e-16 ***
## month10          -2.875e-02  1.995e-02  -1.441 0.149742    
## month11          -6.802e-02  1.893e-02  -3.593 0.000328 ***
## month12          -8.672e-02  1.995e-02  -4.346 1.40e-05 ***
## month2           -3.606e-02  1.995e-02  -1.807 0.070737 .  
## month3           -2.179e-03  1.893e-02  -0.115 0.908348    
## month4            2.500e-02  1.995e-02   1.253 0.210341    
## month5           -3.090e-04  1.893e-02  -0.016 0.986978    
## month6           -4.434e-03  1.995e-02  -0.222 0.824159    
## month7            2.237e-02  1.995e-02   1.121 0.262254    
## month8            2.500e-02  1.893e-02   1.321 0.186621    
## month9           -1.688e-02  1.995e-02  -0.846 0.397537    
## year2016         -1.051e-01  1.893e-02  -5.550 2.95e-08 ***
## year2017         -4.635e-02  1.893e-02  -2.449 0.014364 *  
## total_volume     -5.223e-09  4.852e-10 -10.766  < 2e-16 ***
## month10:year2016  3.939e-01  2.677e-02  14.713  < 2e-16 ***
## month11:year2016  4.574e-01  2.677e-02  17.084  < 2e-16 ***
## month12:year2016  1.781e-01  2.750e-02   6.475 9.99e-11 ***
## month2:year2016  -1.139e-02  2.750e-02  -0.414 0.678800    
## month3:year2016   1.775e-02  2.677e-02   0.663 0.507346    
## month4:year2016  -4.940e-02  2.750e-02  -1.796 0.072543 .  
## month5:year2016  -4.784e-02  2.602e-02  -1.839 0.065989 .  
## month6:year2016   9.171e-02  2.750e-02   3.334 0.000859 ***
## month7:year2016   1.769e-01  2.677e-02   6.606 4.19e-11 ***
## month8:year2016   1.685e-01  2.677e-02   6.296 3.21e-10 ***
## month9:year2016   2.184e-01  2.750e-02   7.942 2.25e-15 ***
## month10:year2017  5.555e-01  2.677e-02  20.749  < 2e-16 ***
## month11:year2017  2.823e-01  2.677e-02  10.543  < 2e-16 ***
## month12:year2017  1.751e-01  2.677e-02   6.540 6.52e-11 ***
## month2:year2017  -1.268e-03  2.750e-02  -0.046 0.963228    
## month3:year2017   2.489e-01  2.677e-02   9.298  < 2e-16 ***
## month4:year2017   2.382e-01  2.677e-02   8.899  < 2e-16 ***
## month5:year2017   2.367e-01  2.677e-02   8.843  < 2e-16 ***
## month6:year2017   2.490e-01  2.751e-02   9.051  < 2e-16 ***
## month7:year2017   2.514e-01  2.677e-02   9.389  < 2e-16 ***
## month8:year2017   3.683e-01  2.677e-02  13.756  < 2e-16 ***
## month9:year2017   5.908e-01  2.751e-02  21.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2074 on 8441 degrees of freedom
## Multiple R-squared:  0.3986, Adjusted R-squared:  0.396 
## F-statistic: 155.4 on 36 and 8441 DF,  p-value: < 2.2e-16

1.3.4.2 month - total volume

mod_interaction2 <- lm(average_price ~ month + year + total_volume + month:total_volume, data = avocado_tidy_conv)
summary(mod_interaction2)
## 
## Call:
## lm(formula = average_price ~ month + year + total_volume + month:total_volume, 
##     data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74814 -0.15598  0.00139  0.14698  0.90601 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.600e-01  9.488e-03 101.183  < 2e-16 ***
## month10               3.081e-01  1.235e-02  24.945  < 2e-16 ***
## month11               1.795e-01  1.259e-02  14.254  < 2e-16 ***
## month12               3.435e-02  1.259e-02   2.729  0.00637 ** 
## month2               -3.052e-02  1.284e-02  -2.378  0.01745 *  
## month3                9.806e-02  1.259e-02   7.786 7.71e-15 ***
## month4                9.963e-02  1.259e-02   7.916 2.76e-15 ***
## month5                6.587e-02  1.236e-02   5.332 9.98e-08 ***
## month6                1.158e-01  1.286e-02   9.004  < 2e-16 ***
## month7                1.701e-01  1.235e-02  13.779  < 2e-16 ***
## month8                2.064e-01  1.260e-02  16.383  < 2e-16 ***
## month9                2.604e-01  1.285e-02  20.260  < 2e-16 ***
## year2016              2.848e-02  6.058e-03   4.701 2.63e-06 ***
## year2017              2.190e-01  6.036e-03  36.288  < 2e-16 ***
## total_volume         -6.682e-09  1.687e-09  -3.962 7.51e-05 ***
## month10:total_volume  1.257e-09  2.778e-09   0.452  0.65109    
## month11:total_volume -6.043e-10  2.838e-09  -0.213  0.83140    
## month12:total_volume  5.215e-11  2.619e-09   0.020  0.98411    
## month2:total_volume   1.597e-12  2.328e-09   0.001  0.99945    
## month3:total_volume   8.612e-10  2.517e-09   0.342  0.73223    
## month4:total_volume   6.403e-10  2.442e-09   0.262  0.79319    
## month5:total_volume   1.892e-09  2.280e-09   0.830  0.40663    
## month6:total_volume   1.647e-09  2.425e-09   0.679  0.49706    
## month7:total_volume   3.053e-09  2.421e-09   1.261  0.20721    
## month8:total_volume   2.115e-09  2.566e-09   0.824  0.41000    
## month9:total_volume   1.086e-09  2.724e-09   0.399  0.69019    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2265 on 8452 degrees of freedom
## Multiple R-squared:  0.2818, Adjusted R-squared:  0.2797 
## F-statistic: 132.7 on 25 and 8452 DF,  p-value: < 2.2e-16

1.3.4.3 year - total_volume

mod_interaction3 <- lm(average_price ~ month + year + total_volume + year:total_volume, data = avocado_tidy_conv)
summary(mod_interaction3)
## 
## Call:
## lm(formula = average_price ~ month + year + total_volume + year:total_volume, 
##     data = avocado_tidy_conv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74847 -0.15549  0.00075  0.14675  0.90728 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            9.575e-01  9.164e-03 104.480  < 2e-16 ***
## month10                3.100e-01  1.165e-02  26.623  < 2e-16 ***
## month11                1.791e-01  1.188e-02  15.076  < 2e-16 ***
## month12                3.461e-02  1.187e-02   2.916  0.00355 ** 
## month2                -3.073e-02  1.212e-02  -2.535  0.01126 *  
## month3                 9.945e-02  1.188e-02   8.374  < 2e-16 ***
## month4                 1.007e-01  1.187e-02   8.482  < 2e-16 ***
## month5                 6.915e-02  1.165e-02   5.934 3.07e-09 ***
## month6                 1.186e-01  1.212e-02   9.781  < 2e-16 ***
## month7                 1.752e-01  1.164e-02  15.047  < 2e-16 ***
## month8                 2.097e-01  1.188e-02  17.657  < 2e-16 ***
## month9                 2.621e-01  1.212e-02  21.619  < 2e-16 ***
## year2016               2.878e-02  6.414e-03   4.487 7.32e-06 ***
## year2017               2.211e-01  6.387e-03  34.608  < 2e-16 ***
## total_volume          -5.069e-09  9.808e-10  -5.168 2.42e-07 ***
## year2016:total_volume -2.268e-10  1.327e-09  -0.171  0.86424    
## year2017:total_volume -1.328e-09  1.320e-09  -1.005  0.31469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2264 on 8461 degrees of freedom
## Multiple R-squared:  0.2816, Adjusted R-squared:  0.2803 
## F-statistic: 207.3 on 16 and 8461 DF,  p-value: < 2.2e-16
relaimpo::calc.relimp(mod_month_year_total_volume, type = "lmg", rela = TRUE)
## Response variable: average_price 
## Total response variance: 0.0711996 
## Analysis based on 8478 observations 
## 
## 14 Regressors: 
## Some regressors combined in groups: 
##         Group  month : month10 month11 month12 month2 month3 month4 month5 month6 month7 month8 month9 
##         Group  year : year2016 year2017 
## 
##  Relative importance of 3 (groups of) regressors assessed: 
##  month year log(total_volume) 
##  
## Proportion of variance explained by model: 31.86%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                         lmg
## month             0.4257359
## year              0.4196833
## log(total_volume) 0.1545809
## 
## Average coefficients for different model sizes: 
## 
##                        1group     2groups     3groups
## month10            0.31239418  0.30752078  0.30222082
## month11            0.16910969  0.16989142  0.17067942
## month12            0.04044872  0.03545544  0.03032544
## month2            -0.03765432 -0.03312167 -0.02822223
## month3             0.08789886  0.09281706  0.09810356
## month4             0.10540598  0.10314432  0.10098503
## month5             0.05263228  0.06296795  0.07387167
## month6             0.11225309  0.11662733  0.12135449
## month7             0.17554233  0.17558347  0.17562821
## month8             0.19845442  0.20300704  0.20789596
## month9             0.25779321  0.25785426  0.25789072
## year2016           0.02763177  0.03026475  0.03237644
## year2017           0.21692523  0.22012532  0.22297489
## log(total_volume) -0.04544362 -0.04436619 -0.04335656